Conversation
- Implemented HumanReview component for entity validation and golden set creation. - Created Sampling component to configure sample size and method for evaluation. - Developed Setup component for dataset selection and compliance framework configuration. - Established routing for new pages including Setup, Sampling, Human Review, and Evaluation. - Defined types for datasets, entities, and evaluation metrics. - Set up main application entry point and integrated styles using Tailwind CSS. - Configured Vite for development with React and Tailwind CSS support.
…tyComparison component
Dependency ReviewThe following issues were found:
|
* feat: Enhance Human Review and Setup pages with dataset handling and auto-accept functionality - Updated HumanReview component to include a "Skip Tagging" button that auto-accepts all entities from records. - Integrated session storage for setup configuration in HumanReview. - Modified Setup component to allow loading datasets from CSV/JSON files with a preview feature. - Added new types for UploadedDataset and SetupConfig to manage dataset metadata. - Implemented backend API for loading datasets, including CSV and JSON parsing. - Created sample medical records dataset for testing and demonstration purposes. * feat: Implement auto-confirm all functionality in Human Review page
| file_path = os.path.expanduser(req.path) | ||
| if not os.path.isabs(file_path): | ||
| raise HTTPException(status_code=400, detail="Path must be absolute.") | ||
| if not os.path.isfile(file_path): |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 20 days ago
In general, the problem is that user-controlled data (req["path"]) is used directly as a filesystem path. To fix this, we must validate and constrain the path before calling os.path.isfile and open. A common and appropriate strategy here is to define a safe root directory (for example, the existing _PROJECT_ROOT or a dedicated subdirectory under it), normalize the user-supplied path with os.path.realpath or os.path.normpath relative to that root, and then check that the normalized path is actually within the root (e.g., using os.path.commonpath). Only then should we allow file access.
The best fix that preserves existing functionality while making it safe is:
- Interpret the incoming
req["path"]as a relative path under a safe root (e.g.,_PROJECT_ROOT), not as an arbitrary absolute filesystem path. - Normalize the joined path using
os.path.realpathoros.path.normpath. - Use
os.path.commonpath([safe_root, normalized]) == safe_rootto ensure the resulting path does not escape the root via..or symlinks. - If validation fails, return
400with an explanatory message. - Then use the validated path for
os.path.isfileandopen.
To implement this in evaluation/ai-assistant/backend/routers/upload.py:
- Reuse the existing
_PROJECT_ROOTconstant as the safe root, or introduce a dedicated_ALLOWED_CSV_ROOTthat points somewhere under_PROJECT_ROOT(e.g.,os.path.join(_PROJECT_ROOT, "data")). We'll reuse_PROJECT_ROOTsince it already exists and no new imports are necessary. - Update
get_csv_columns_from_path:- Read
raw_pathfromreq["path"]. - Reject empty or purely whitespace paths.
- Disallow path separators that would indicate attempts to pass an absolute path directly; instead, treat the value as relative and always join against
_PROJECT_ROOT. - Compute
candidate = os.path.realpath(os.path.join(_PROJECT_ROOT, raw_path)). - Check
os.path.commonpath([_PROJECT_ROOT, candidate]) == _PROJECT_ROOT; if not, reject. - Use
candidatein the subsequentos.path.isfileandopencalls.
- Read
This keeps behavior close to the original intent (read a CSV-like file accessible to the backend) but ensures only files under the repository root can be read, and prevents path traversal or arbitrary absolute-path access. No new imports or external dependencies are required.
| @@ -264,13 +264,22 @@ | ||
|
|
||
| @router.post("/columns-from-path") | ||
| async def get_csv_columns_from_path(req: dict): | ||
| """Read the header row of a CSV at the given absolute path.""" | ||
| file_path = os.path.expanduser(req.get("path", "")) | ||
| if not file_path or not os.path.isabs(file_path): | ||
| raise HTTPException(status_code=400, detail="Path must be absolute.") | ||
| if not os.path.isfile(file_path): | ||
| raise HTTPException(status_code=400, detail=f"File not found: {file_path}") | ||
| with open(file_path, encoding="utf-8") as f: | ||
| """Read the header row of a CSV at the given path under the project root.""" | ||
| raw_path = (req.get("path") or "").strip() | ||
| if not raw_path: | ||
| raise HTTPException(status_code=400, detail="Path is required.") | ||
| # Resolve the user-supplied path against the project root and ensure it stays within it. | ||
| candidate_path = os.path.realpath(os.path.join(_PROJECT_ROOT, raw_path)) | ||
| try: | ||
| common = os.path.commonpath([_PROJECT_ROOT, candidate_path]) | ||
| except ValueError: | ||
| # Different drives or invalid paths | ||
| raise HTTPException(status_code=400, detail="Invalid path.") | ||
| if common != _PROJECT_ROOT: | ||
| raise HTTPException(status_code=400, detail="Access to this path is not allowed.") | ||
| if not os.path.isfile(candidate_path): | ||
| raise HTTPException(status_code=400, detail=f"File not found: {raw_path}") | ||
| with open(candidate_path, encoding="utf-8") as f: | ||
| head = f.read(65_536) | ||
| reader = csv.DictReader(io.StringIO(head)) | ||
| columns = list(reader.fieldnames or []) |
| if not os.path.isfile(file_path): | ||
| raise HTTPException(status_code=400, detail=f"File not found: {file_path}") | ||
|
|
||
| file_size = os.path.getsize(file_path) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 20 days ago
In general, the problem is fixed by restricting user-controlled paths to a well-defined safe root directory and validating the normalized path before using it, or by otherwise constraining the allowed files (for example, an allow list). For this backend, there is already a _DATA_DIR directory intended for managed datasets; the safest and least invasive fix is to ensure that any path passed to /load and /columns-from-path is resolved relative to _DATA_DIR (or another chosen safe root) and that the normalized final path is checked to be inside this directory before any filesystem operations.
Concretely, we can introduce a helper _resolve_safe_path that: (1) takes the user-provided path (which may be absolute or relative), (2) expands ~, (3) if it is absolute, strips the leading path separator and treats it as relative to _DATA_DIR rather than the filesystem root, (4) joins this with _DATA_DIR, (5) normalizes it with os.path.normpath, and (6) checks that the resulting path starts with the _DATA_DIR prefix (using a robust prefix check that avoids partial-directory matches). If the check fails, we raise HTTPException(400, "Path not allowed."). We then use this helper for both get_csv_columns_from_path (around line 265) and load_dataset (around line 288) instead of directly using os.path.expanduser and absolute-path checks. This preserves existing functionality to load arbitrary CSV/JSON files within the project’s data directory while preventing access to other filesystem locations.
To implement this, we will:
- Add a new helper
_resolve_safe_pathnear the existing_resolve_pathhelper. - Update
get_csv_columns_from_pathto call_resolve_safe_path(req.get("path", "")), remove the absolute-path requirement, and keep the existing checks for existence and file type. - Update
load_datasetto call_resolve_safe_path(req.path)and reuse existing size and format validations.
No new external libraries are needed, and we only rely onos.pathwhich is already imported.
| @@ -81,6 +81,32 @@ | ||
| return os.path.normpath(os.path.join(_PROJECT_ROOT, path)) | ||
|
|
||
|
|
||
| def _resolve_safe_path(user_path: str) -> str: | ||
| """ | ||
| Resolve a user-provided path to a location under the managed data directory. | ||
|
|
||
| The returned path is guaranteed to be contained within ``_DATA_DIR`` or an | ||
| HTTPException is raised. | ||
| """ | ||
| if not user_path: | ||
| raise HTTPException(status_code=400, detail="Path is required.") | ||
|
|
||
| expanded = os.path.expanduser(user_path) | ||
|
|
||
| # Treat absolute paths as paths relative to the data directory root | ||
| if os.path.isabs(expanded): | ||
| expanded = expanded.lstrip(os.sep) | ||
|
|
||
| candidate = os.path.normpath(os.path.join(_DATA_DIR, expanded)) | ||
|
|
||
| data_dir_norm = os.path.normpath(_DATA_DIR) | ||
| # Ensure the candidate path is inside _DATA_DIR | ||
| if not (candidate == data_dir_norm or candidate.startswith(data_dir_norm + os.sep)): | ||
| raise HTTPException(status_code=400, detail="Path not allowed.") | ||
|
|
||
| return candidate | ||
|
|
||
|
|
||
| def _ensure_stored_copy(dataset_id: str) -> UploadedDataset: | ||
| """Ensure a dataset has a managed copy under backend/data.""" | ||
| ds = _uploaded.get(dataset_id) | ||
| @@ -264,10 +290,9 @@ | ||
|
|
||
| @router.post("/columns-from-path") | ||
| async def get_csv_columns_from_path(req: dict): | ||
| """Read the header row of a CSV at the given absolute path.""" | ||
| file_path = os.path.expanduser(req.get("path", "")) | ||
| if not file_path or not os.path.isabs(file_path): | ||
| raise HTTPException(status_code=400, detail="Path must be absolute.") | ||
| """Read the header row of a CSV at the given path under the data directory.""" | ||
| raw_path = req.get("path", "") | ||
| file_path = _resolve_safe_path(raw_path) | ||
| if not os.path.isfile(file_path): | ||
| raise HTTPException(status_code=400, detail=f"File not found: {file_path}") | ||
| with open(file_path, encoding="utf-8") as f: | ||
| @@ -287,7 +312,7 @@ | ||
|
|
||
| @router.post("/load") | ||
| async def load_dataset(req: DatasetLoadRequest): | ||
| """Load a CSV or JSON file from a local absolute path.""" | ||
| """Load a CSV or JSON file from a local path under the data directory.""" | ||
| if req.format not in ("csv", "json"): | ||
| raise HTTPException( | ||
| status_code=400, | ||
| @@ -297,9 +322,7 @@ | ||
| ), | ||
| ) | ||
|
|
||
| file_path = os.path.expanduser(req.path) | ||
| if not os.path.isabs(file_path): | ||
| raise HTTPException(status_code=400, detail="Path must be absolute.") | ||
| file_path = _resolve_safe_path(req.path) | ||
| if not os.path.isfile(file_path): | ||
| raise HTTPException(status_code=400, detail=f"File not found: {file_path}") | ||
|
|
| if file_size > MAX_FILE_SIZE: | ||
| raise HTTPException(status_code=400, detail="File too large (max 50 MB)") | ||
|
|
||
| with open(file_path, encoding="utf-8") as f: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 20 days ago
In general, to fix uncontrolled path usage, you should constrain file access to a known-safe directory (or set of directories) and validate any user-supplied path against that root. This typically involves: resolving ~ and relative segments, normalizing with os.path.realpath or os.path.normpath, then checking that the resulting path is within the allowed root. Deny access if the path escapes that root or is not a regular file.
In this code, load_dataset currently requires an absolute path and permits access anywhere. The least-disruptive fix that keeps the same general functionality but removes the vulnerability is:
- Define a dedicated root directory for loadable datasets (for example, reuse
_DATA_DIRas the allowed root) or a new_ALLOWED_DATA_ROOT. - Update
load_datasetto:- Expand user (
os.path.expanduser) and normalize/resolve the candidate path usingos.path.realpath. - Join relative inputs to the allowed root if you decide to support non-absolute paths, or continue to require absolute paths but still enforce containment.
- Verify that the resolved path is under the allowed root using a robust prefix check such as
os.path.commonpath([_ALLOWED_DATA_ROOT, resolved_path]) == _ALLOWED_DATA_ROOT. - Optionally, reject paths that are not regular files.
- Expand user (
- Use this validated
file_pathforos.path.isfile,os.path.getsize, andopen.
To avoid assuming anything outside the snippet, we can introduce a new _ALLOWED_DATA_ROOT constant alongside _DATA_DIR at the top of the file and a small helper _validate_and_resolve_user_path near _resolve_path or directly in load_dataset. Since os is already imported, no new imports are required. We will also tighten get_csv_columns_from_path in the same way, because it has the same pattern: arbitrary absolute path from req["path"] going straight into open().
Concretely:
- Add
_ALLOWED_DATA_ROOT = _DATA_DIR(or a sibling directory) where_DATA_DIRis defined. - Add a helper
_resolve_user_file_path(raw_path: str) -> strthat:- Ensures
raw_pathis non-empty. - Expands
~, obtainsrealpath, and checks thatcommonpathwith_ALLOWED_DATA_ROOTequals_ALLOWED_DATA_ROOT. - Checks
os.path.isfile. - Returns the safe path or raises
HTTPException(400, ...)on violation.
- Ensures
- Replace in
get_csv_columns_from_pathandload_datasetthe manualexpanduser/isabs/isfilelogic with calls to_resolve_user_file_path.
This keeps the external behavior (loading given paths in a controlled dataset directory) while preventing directory traversal or arbitrary file access outside the allowed root.
* feat: implement sampling configuration and record retrieval in Sampling component * change sampling message * feat: add sampling methods and integrate into sampling configuration
* Refactor entity comparison logic to support new entity status and source tracking - Updated EntityComparison component to handle multiple sources for entities and revised status types. - Enhanced logic for combining and classifying entities from Presidio, LLM, and predefined datasets. - Improved context retrieval for entities using indexOf for accurate highlighting. - Adjusted UI badges for entity statuses to reflect new terminology. Implement LLM Judge functionality in Anonymization page - Added state management for LLM Judge including model selection, connection status, and progress tracking. - Integrated API calls to fetch model configurations and analyze records. - Enhanced user feedback with loading indicators and error handling. Fetch and display sampled records in Human Review page - Implemented API calls to load records and LLM results on component mount. - Updated record handling to support dynamic data fetching and error management. - Improved UI to reflect loading states and error messages. Enhance dataset management in Setup page - Added functionality to fetch saved datasets from the backend on component mount. - Introduced fields for dataset name and description during dataset upload. - Implemented editing and deletion capabilities for existing datasets. Update types to include new dataset properties - Modified UploadedDataset interface to include name, description, and path fields. * refactor: update entity status legend and remove conflict indication
…y in UI components
| if req.config_path: | ||
| if not os.path.isabs(req.config_path): | ||
| raise HTTPException(status_code=400, detail="Config path must be an absolute path.") | ||
| if not os.path.isfile(req.config_path): |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 20 days ago
In general, the fix is to ensure that any filesystem path derived from user input is constrained to a safe directory and/or to an allow‑listed set of known paths. For this API, the safest approach without changing intended functionality is to only allow config_path values that point inside a designated data/config directory, or that exactly match one of the saved configuration paths stored in the existing _CONFIGS_FILE mechanism. This lets administrators choose from previously uploaded configs instead of arbitrary server paths.
Concretely, within configure_presidio, instead of accepting any absolute req.config_path, we can resolve it to a Path, normalize it, and then verify that it is (a) absolute and (b) under a known safe root directory used for Presidio configs. The code already uses a _DATA_DIR when saving uploaded configs and stores absolute file paths in the config registry; we can use that as the root and enforce resolved_path.is_relative_to(_DATA_DIR) (for Python 3.9+, or a manual prefix/ancestor check). If the path is not under _DATA_DIR, we reject the request with HTTP 400. We still keep the os.path.isfile check but apply it to the sanitized path. To avoid changing behavior more than necessary, we allow both the explicit config_path (if safe) and the previously existing “named configs” mechanism.
Implementation details:
- Ensure
_DATA_DIRis defined in the same file (it already must be forupload_config; we just rely on it). - In
configure_presidio, replace the currentif req.config_path:block (lines 207–211) with logic that:- Wraps
req.config_pathinPath, calls.resolve()to normalize. - Checks that the resolved path is within
_DATA_DIR(viais_relative_toor fallback). - Verifies that the file exists (
resolved_path.is_file()). - Uses the sanitized string form of this resolved path going forward (assign back to
req.config_pathor a local variable).
- Wraps
- This keeps functionality (config still loaded from a path), but ensures it’s only from our configs directory, eliminating arbitrary filesystem access.
| @@ -205,11 +205,32 @@ | ||
| global _engine | ||
|
|
||
| if req.config_path: | ||
| if not os.path.isabs(req.config_path): | ||
| raise HTTPException(status_code=400, detail="Config path must be an absolute path.") | ||
| if not os.path.isfile(req.config_path): | ||
| raise HTTPException(status_code=400, detail=f"Config file not found: {req.config_path}") | ||
| raw_path = req.config_path | ||
| try: | ||
| resolved_path = Path(raw_path).resolve() | ||
| except Exception: | ||
| raise HTTPException(status_code=400, detail="Invalid config path.") | ||
|
|
||
| # Ensure the config file resides within the Presidio data directory | ||
| try: | ||
| is_within_data_dir = resolved_path.is_relative_to(_DATA_DIR) | ||
| except AttributeError: | ||
| # Fallback for Python versions without Path.is_relative_to | ||
| try: | ||
| resolved_data_dir = _DATA_DIR.resolve() | ||
| except Exception: | ||
| raise HTTPException(status_code=500, detail="Server configuration error.") | ||
| is_within_data_dir = resolved_path == resolved_data_dir or resolved_data_dir in resolved_path.parents | ||
|
|
||
| if not is_within_data_dir: | ||
| raise HTTPException(status_code=400, detail="Config path must be inside the server config directory.") | ||
|
|
||
| if not resolved_path.is_file(): | ||
| raise HTTPException(status_code=400, detail=f"Config file not found: {raw_path}") | ||
|
|
||
| # Use the normalized, safe path from here on | ||
| req.config_path = str(resolved_path) | ||
|
|
||
| # Reset | ||
| _engine = None | ||
| _state["loading"] = True |
… extras and add custom analyzer configuration
| raise HTTPException(status_code=400, detail="Config path is required.") | ||
| if not os.path.isabs(path): | ||
| raise HTTPException(status_code=400, detail="Config path must be absolute.") | ||
| if not os.path.isfile(path): |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 20 days ago
In general, to fix this class of problem, you should ensure that any path derived from user input is constrained to a safe location and/or sanitized. Common patterns include: (1) allowing only filenames and joining them with a fixed, trusted base directory; (2) normalizing and verifying that the resulting path stays within a designated root directory; or (3) enforcing an allow list of known-safe paths. Which one to use depends on how much flexibility is required.
For this endpoint, the best low-impact fix is to restrict configuration imports to a specific “configs root” directory on the server, and then verify that the requested path stays within that root after normalization. We can do this by: defining a _CONFIGS_ROOT directory (for example, under the project root’s configs or data/configs folder); joining the user-provided path to this root; normalizing it with os.path.normpath; and then checking that the resulting full path is both a file and resides under _CONFIGS_ROOT. Instead of requiring the client to send an absolute path, we then treat req.path as a relative path (or simple filename) under _CONFIGS_ROOT. Finally, shutil.copy2 must operate on this validated fullpath rather than the raw user string.
Concretely, in evaluation/ai-assistant/backend/routers/presidio_service.py:
- Add a module-level constant
_CONFIGS_ROOTnear_DATA_DIR, e.g.Path(__file__).resolve().parent.parent / "configs". - In
save_config, stop requiringos.path.isabs(path), and instead:- Reject path components that look like absolute paths or contain null bytes.
- Build
fullpath = os.path.normpath(os.path.join(_CONFIGS_ROOT, path)). - Ensure
fullpathis inside_CONFIGS_ROOT(e.g. comparingfullpath.resolve()to_CONFIGS_ROOT.resolve()viais_relative_towhen available, or a safestartswithon path components). - Check
fullpath.is_file()instead ofos.path.isfile(path).
- Use
fullpathas the source argument toshutil.copy2instead of the unvalidatedpath.
This preserves the intended functionality (importing a config file into _DATA_DIR) while eliminating the ability to reference arbitrary filesystem paths.
| @@ -60,6 +60,7 @@ | ||
| # Per-config results accumulator (replaces run snapshots) | ||
| # --------------------------------------------------------------------------- | ||
| _DATA_DIR = Path(__file__).resolve().parent.parent / "data" | ||
| _CONFIGS_ROOT = Path(__file__).resolve().parent.parent / "configs" | ||
|
|
||
| # config_name -> {rec_id -> entities} | ||
| _all_config_results: dict[str, dict[str, list]] = {} | ||
| @@ -123,16 +124,28 @@ | ||
| raise HTTPException(status_code=400, detail="Name may only contain letters, numbers, hyphens, underscores, dots, and spaces.") | ||
| if not path: | ||
| raise HTTPException(status_code=400, detail="Config path is required.") | ||
| if not os.path.isabs(path): | ||
| raise HTTPException(status_code=400, detail="Config path must be absolute.") | ||
| if not os.path.isfile(path): | ||
|
|
||
| # Treat the provided path as relative to a trusted configs root and | ||
| # validate that the normalized path stays within that root. | ||
| # Disallow absolute paths outright. | ||
| if os.path.isabs(path): | ||
| raise HTTPException(status_code=400, detail="Config path must be relative to the server configs directory.") | ||
| # Build and normalize the full path under the configs root | ||
| _CONFIGS_ROOT.mkdir(parents=True, exist_ok=True) | ||
| fullpath = os.path.normpath(os.path.join(str(_CONFIGS_ROOT), path)) | ||
| # Ensure the normalized path is still within the configs root | ||
| configs_root_str = str(_CONFIGS_ROOT.resolve()) | ||
| fullpath_resolved = os.path.realpath(fullpath) | ||
| if not fullpath_resolved.startswith(configs_root_str + os.sep) and fullpath_resolved != configs_root_str: | ||
| raise HTTPException(status_code=400, detail="Config path is not allowed.") | ||
| if not os.path.isfile(fullpath_resolved): | ||
| raise HTTPException(status_code=400, detail=f"Config file not found: {path}") | ||
|
|
||
| # Copy the config file into our data/ folder so we own the copy | ||
| _DATA_DIR.mkdir(parents=True, exist_ok=True) | ||
| safe_name = name.replace(" ", "_").replace("/", "_") | ||
| dest = _DATA_DIR / f"config-{safe_name}.yml" | ||
| shutil.copy2(path, dest) | ||
| shutil.copy2(fullpath_resolved, dest) | ||
| abs_path = str(dest.resolve()) | ||
|
|
||
| user_configs = _get_user_configs() |
…in Welcome component
| _DATA_DIR.mkdir(parents=True, exist_ok=True) | ||
| safe_name = name.replace(" ", "_").replace("/", "_") | ||
| dest = _DATA_DIR / f"config-{safe_name}.yml" | ||
| shutil.copy2(path, dest) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 20 days ago
In general, the fix is to restrict and validate any user-controlled path before using it for filesystem access. Common approaches are: (1) restrict paths to live under a specific trusted root directory and verify this after normalizing the path, or (2) maintain an allow list of permitted paths and reject everything else.
Here, we want to keep the current functionality of letting users “register” a config file, but without letting them point to arbitrary locations like /etc/shadow. The safest change, without altering the later behavior (which only ever reads from our own _DATA_DIR copies), is to restrict the source config file path to be under a known safe root. A natural candidate is the project’s data directory (_DATA_DIR’s parent or a subdirectory of it), or some other predetermined configs root. We can implement this by normalizing the input path with os.path.realpath, then checking that it is inside a chosen base directory using os.path.commonpath. If the check fails, we return a 400 error.
Concretely, in save_config (lines 117–143), after confirming that path is absolute and points to an existing file, we will:
- Normalize the user-supplied
pathviaos.path.realpath. - Define a trusted root directory for source configs, e.g. a
configssubdirectory under_DATA_DIR(or_DATA_DIRitself, depending on your policy). - Use
os.path.commonpath([trusted_root, normalized_path]) == trusted_rootto ensure the normalized path is within the trusted root. - Reject paths outside that directory with
HTTPException(400, ...).
We will then use the normalized path (normalized_path) in the shutil.copy2 call, instead of the raw path. This adds path traversal protection and prevents access to arbitrary filesystem locations, while preserving the existing behavior for valid config files in the allowed directory. All needed utilities (os.path.realpath, os.path.commonpath) are already available via the existing import os, so no new imports are required.
| @@ -128,11 +128,17 @@ | ||
| if not os.path.isfile(path): | ||
| raise HTTPException(status_code=400, detail=f"Config file not found: {path}") | ||
|
|
||
| # Restrict source config files to live under the data directory (or a subdirectory) | ||
| base_path = os.path.realpath(str(_DATA_DIR)) | ||
| source_path = os.path.realpath(path) | ||
| if os.path.commonpath([base_path, source_path]) != base_path: | ||
| raise HTTPException(status_code=400, detail="Config path must be located under the allowed data directory.") | ||
|
|
||
| # Copy the config file into our data/ folder so we own the copy | ||
| _DATA_DIR.mkdir(parents=True, exist_ok=True) | ||
| safe_name = name.replace(" ", "_").replace("/", "_") | ||
| dest = _DATA_DIR / f"config-{safe_name}.yml" | ||
| shutil.copy2(path, dest) | ||
| shutil.copy2(source_path, dest) | ||
| abs_path = str(dest.resolve()) | ||
|
|
||
| user_configs = _get_user_configs() |
| _DATA_DIR.mkdir(parents=True, exist_ok=True) | ||
| safe_name = name.replace(" ", "_").replace("/", "_") | ||
| dest = _DATA_DIR / f"config-{safe_name}.yml" | ||
| shutil.copy2(path, dest) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Copilot Autofix
AI 20 days ago
Copilot could not generate an autofix suggestion
Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.
| safe_name = name.replace(" ", "_").replace("/", "_") | ||
| dest = _DATA_DIR / f"config-{safe_name}.yml" | ||
| shutil.copy2(path, dest) | ||
| abs_path = str(dest.resolve()) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Copilot Autofix
AI 20 days ago
Copilot could not generate an autofix suggestion
Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.
| _DATA_DIR.mkdir(parents=True, exist_ok=True) | ||
| safe_name = name.replace(" ", "_").replace("/", "_") | ||
| dest = _DATA_DIR / f"config-{safe_name}.yml" | ||
| dest.write_bytes(content) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Copilot Autofix
AI 20 days ago
Copilot could not generate an autofix suggestion
Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.
| safe_name = name.replace(" ", "_").replace("/", "_") | ||
| dest = _DATA_DIR / f"config-{safe_name}.yml" | ||
| dest.write_bytes(content) | ||
| abs_path = str(dest.resolve()) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Copilot Autofix
AI 20 days ago
Copilot could not generate an autofix suggestion
Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.
| safe_ds_name = display_name.replace(" ", "_").replace("/", "_") | ||
| stored_filename = f"{safe_ds_name}_{uid}{ext}" | ||
| stored_path = os.path.join(_DATA_DIR, stored_filename) | ||
| shutil.copy2(file_path, stored_path) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 20 days ago
In general, paths derived from user input must be constrained to a safe directory tree. The usual pattern is: define a fixed root directory, normalize the user-supplied path relative to that root, and reject any path that resolves outside the root. For absolute paths, you typically either (a) disallow them, or (b) normalize them and verify they live under the allowed root before accessing them.
For this code, the safest minimal fix that preserves behavior is:
- Introduce a dedicated data root for user-loadable files (e.g., under the existing
_DATA_DIRor a sibling directory). - Add a helper
_resolve_user_paththat:- Takes the raw
req.path(afterexpanduser). - Normalizes it with
os.path.realpath. - Joins it to the allowed root if it is not absolute, or at least checks that the resulting real path starts with the allowed root path.
- Rejects paths that fall outside the allowed root with an HTTP 400/403.
- Takes the raw
- Use this helper instead of the current direct
expanduser+isabslogic inload_dataset, and also inget_csv_columns_from_pathfor consistency, so that both endpoints only operate on files inside the allowed root. - Keep the rest of the logic unchanged (copying to
_DATA_DIR, size checks, parsing, etc.).
Concretely:
- In
evaluation/ai-assistant/backend/routers/upload.py, define a new constant like_IMPORT_ROOTthat points to a dedicated directory under the project root (e.g.,os.path.join(_PROJECT_ROOT, "import")), and ensure it exists (os.makedirs(..., exist_ok=True)). - Define a new function
_resolve_user_path(raw_path: str) -> strnear_resolve_paththat:- Validates presence of
raw_path. - Expands
~. - If the resulting path is absolute, uses it directly; if it is relative, join it to
_IMPORT_ROOT. - Normalizes with
os.path.realpath. - Verifies it starts with
_IMPORT_ROOT(usingos.path.commonpathor prefix check on normalized paths). - Raises
HTTPExceptionif invalid or outside root.
- Validates presence of
- Update
get_csv_columns_from_pathandload_datasetto call_resolve_user_path(...)instead of manually working with user-controlled paths. Keep checks likeos.path.isfileand size limits the same but applied to the resolved safe path. - Ensure no other changes to business logic (dataset naming, registry saving, etc.).
| safe_ds_name = display_name.replace(" ", "_").replace("/", "_") | ||
| stored_filename = f"{safe_ds_name}_{uid}{ext}" | ||
| stored_path = os.path.join(_DATA_DIR, stored_filename) | ||
| shutil.copy2(file_path, stored_path) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 20 days ago
In general, to fix uncontrolled data in path expressions where you want to allow some flexibility but keep files within a specific directory, you should normalize the final path and then verify that it remains inside a designated root directory. This usually means combining a fixed base directory with a user-influenced filename, normalizing with os.path.normpath (or os.path.realpath), and then checking that the normalized path starts with the expected base directory (plus a path separator, to avoid prefix tricks).
For this specific code, the issue is around how stored_path is constructed from display_name (user input) and then passed into shutil.copy2. We already have _DATA_DIR as a fixed storage directory, and we already lightly sanitize the display name and validate it with _validate_name. The best minimal fix is to additionally normalize and validate stored_path itself, ensuring it cannot escape _DATA_DIR even if a future change weakens _validate_name or path semantics differ across platforms.
Concretely, in load_dataset:
- Keep the generation of
safe_ds_nameandstored_filenameas-is to preserve current naming behavior. - After building
stored_path = os.path.join(_DATA_DIR, stored_filename), compute a normalized absolute path, e.g.normalized_stored_path = os.path.normpath(os.path.abspath(stored_path)). - Do the same for
_DATA_DIR(once, or inline here) and verify thatnormalized_stored_pathis within_DATA_DIR. A straightforward, cross-platform-safe way that avoids prefix edge cases is to useos.path.commonpath([normalized_stored_path, _DATA_DIR]) == os.path.normpath(_DATA_DIR). - If the check fails, raise an
HTTPExceptionwith a 400 code indicating that the dataset name is invalid. - Use
normalized_stored_pathfor the copy and when savingstored_pathintoUploadedDataset.
All changes are confined to evaluation/ai-assistant/backend/routers/upload.py around lines 336–350; no new imports are needed because os.path is already imported as os.
| @@ -337,7 +337,11 @@ | ||
| safe_ds_name = display_name.replace(" ", "_").replace("/", "_") | ||
| stored_filename = f"{safe_ds_name}_{uid}{ext}" | ||
| stored_path = os.path.join(_DATA_DIR, stored_filename) | ||
| shutil.copy2(file_path, stored_path) | ||
| normalized_data_dir = os.path.normpath(os.path.abspath(_DATA_DIR)) | ||
| normalized_stored_path = os.path.normpath(os.path.abspath(stored_path)) | ||
| if os.path.commonpath([normalized_stored_path, normalized_data_dir]) != normalized_data_dir: | ||
| raise HTTPException(status_code=400, detail="Invalid dataset name.") | ||
| shutil.copy2(file_path, normalized_stored_path) | ||
|
|
||
| description = req.description.strip() if req.description else "" | ||
| dataset = UploadedDataset( | ||
| @@ -346,7 +350,7 @@ | ||
| name=display_name, | ||
| description=description, | ||
| path=file_path, | ||
| stored_path=stored_path, | ||
| stored_path=normalized_stored_path, | ||
| format=req.format, | ||
| record_count=len(records), | ||
| has_entities=has_entities, |
| safe_ds_name = display_name.replace(" ", "_").replace("/", "_") | ||
| stored_filename = f"{safe_ds_name}_{uid}.csv" | ||
| stored_path = os.path.join(_DATA_DIR, stored_filename) | ||
| with open(stored_path, "w", encoding="utf-8") as f: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
| stored_filename = f"{safe_name}_{dataset_id}{ext}" | ||
| stored_path = os.path.join(_DATA_DIR, stored_filename) | ||
|
|
||
| shutil.copy2(resolved, stored_path) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 20 days ago
In general, to fix uncontrolled path usage we must (1) restrict all file operations to a known safe root, (2) normalize paths before checking them, and (3) ensure filenames are sanitized. In this code, the primary issues are that _resolve_path allows arbitrary absolute paths (and relative paths that may traverse upwards), and that ds.name and dataset_id are used in the stored filename without extra validation. We can address all the variants by making _resolve_path enforce that any dataset path resolves into a dedicated datasets root directory, and by tightening the sanitization of the generated stored filename. Because _ensure_records_loaded and the /download route both rely on _ensure_stored_copy and _resolve_path, strengthening those two places will automatically cover all the CodeQL variants.
Concretely, we can:
- Introduce a dedicated datasets root directory under the backend (e.g.
backend/data/datasets) derived from_DATA_DIRand ensure it exists. - Update
_resolve_pathso it always resolves against this datasets root:- Compute
candidate = os.path.normpath(os.path.join(_DATASETS_ROOT, path)). - Reject the path if
os.path.isabs(path)or if..is used to escape the root (checked viaos.path.commonpath). - Return the safe normalized
candidate.
This prevents absolute paths and directory traversal while still allowing relative dataset paths inside the datasets root.
- Compute
- Harden the stored filename generation in
_ensure_stored_copyby restricting characters:- Derive a
safe_namefromds.nameusing a regex (similar to_NAME_RE) or a simple whitelist of alphanumerics, space, dot, underscore, and hyphen; convert anything else to_. - Similarly sanitize
dataset_idso even if it’s user-controlled, it cannot inject path separators or special characters. - Build
stored_filename = f"{safe_name}_{safe_id}{ext}"and join with_DATA_DIRas before.
- Derive a
These changes stay within the provided files, only add basic os.path.commonpath usage (no new dependencies), and do not change the external API; they only reject malicious or malformed dataset paths and filenames.
| @@ -46,7 +46,11 @@ | ||
| _DATA_DIR = os.path.join(os.path.dirname(os.path.dirname(__file__)), "data") | ||
| os.makedirs(_DATA_DIR, exist_ok=True) | ||
|
|
||
| # Root directory for dataset source files (restrict all resolved paths to this tree) | ||
| _DATASETS_ROOT = os.path.join(_DATA_DIR, "datasets") | ||
| os.makedirs(_DATASETS_ROOT, exist_ok=True) | ||
|
|
||
|
|
||
| def _save_registry() -> None: | ||
| """Persist the dataset registry to disk.""" | ||
| data = [ds.model_dump() for ds in _uploaded.values()] | ||
| @@ -75,12 +78,40 @@ | ||
|
|
||
|
|
||
| def _resolve_path(path: str) -> str: | ||
| """Resolve a path; relative paths are resolved against the project root.""" | ||
| """Resolve a dataset path safely under the datasets root directory. | ||
|
|
||
| Absolute paths and paths that escape the configured datasets root are rejected. | ||
| """ | ||
| # Reject absolute paths to avoid pointing outside the managed datasets tree | ||
| if os.path.isabs(path): | ||
| return path | ||
| return os.path.normpath(os.path.join(_PROJECT_ROOT, path)) | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail="Absolute paths are not allowed for dataset files.", | ||
| ) | ||
|
|
||
| # Resolve relative path against the dedicated datasets root | ||
| candidate = os.path.normpath(os.path.join(_DATASETS_ROOT, path)) | ||
|
|
||
| # Ensure the normalized path stays within the datasets root (prevents '..' escape) | ||
| datasets_root_norm = os.path.normpath(_DATASETS_ROOT) | ||
| try: | ||
| common = os.path.commonpath([datasets_root_norm, candidate]) | ||
| except ValueError: | ||
| # On path-type mismatch, treat as invalid | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail="Invalid dataset path.", | ||
| ) | ||
|
|
||
| if common != datasets_root_norm: | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail="Dataset path escapes the allowed datasets directory.", | ||
| ) | ||
|
|
||
| return candidate | ||
|
|
||
|
|
||
| def _ensure_stored_copy(dataset_id: str) -> UploadedDataset: | ||
| """Ensure a dataset has a managed copy under backend/data.""" | ||
| ds = _uploaded.get(dataset_id) | ||
| @@ -98,8 +126,10 @@ | ||
| ) | ||
|
|
||
| ext = os.path.splitext(resolved)[1] or ".csv" | ||
| safe_name = ds.name.replace(" ", "_").replace("/", "_") | ||
| stored_filename = f"{safe_name}_{dataset_id}{ext}" | ||
| # Sanitize dataset name and id to avoid injecting path separators or special chars | ||
| safe_name = re.sub(r"[^A-Za-z0-9_.\-]+", "_", ds.name.strip()) or "dataset" | ||
| safe_id = re.sub(r"[^A-Za-z0-9_.\-]+", "_", dataset_id) | ||
| stored_filename = f"{safe_name}_{safe_id}{ext}" | ||
| stored_path = os.path.join(_DATA_DIR, stored_filename) | ||
|
|
||
| shutil.copy2(resolved, stored_path) |
| raise HTTPException(status_code=400, detail="Path must be absolute.") | ||
| if not os.path.isfile(file_path): | ||
| raise HTTPException(status_code=400, detail=f"File not found: {file_path}") | ||
| with open(file_path, encoding="utf-8") as f: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 20 days ago
In general, to fix uncontrolled path usage you must (1) define a safe root directory for any file access based on user input, (2) normalize the combined path using os.path.normpath (and preferably os.path.realpath), and (3) verify the normalized path is within the safe root before passing it to filesystem APIs like open, os.path.isfile, etc. Optionally, also restrict the allowed file extension (.csv here).
For this specific function, the least intrusive, backwards‑compatible fix is to require that the caller’s path lies under an allowed base directory (for example _PROJECT_ROOT or a dedicated _DATA_DIR) and to normalize the path before use. We can reuse _PROJECT_ROOT that already exists in this file to implement a simple _resolve_and_validate_path helper that expands ~, resolves relative paths against _PROJECT_ROOT, normalizes via os.path.realpath, and then enforces that the result is still under _PROJECT_ROOT (using a prefix check on the normalized absolute path). Then get_csv_columns_from_path should call this helper instead of using os.path.expanduser directly and should apply the existing .csv check already used in get_csv_columns. We only need to modify the _resolve_path helper and the get_csv_columns_from_path endpoint in this file; no new imports are required.
Concretely:
- Update
_resolve_pathto normalize and secure paths:- Expand
~withos.path.expanduser. - If the input is absolute, normalize/realpath it.
- If relative, join it to
_PROJECT_ROOT, then realpath it. - Verify the resulting path starts with
_PROJECT_ROOT(with a separator boundary). - Raise
HTTPException(400, "Access to this path is not allowed.")if check fails.
- Expand
- In
get_csv_columns_from_path:- Replace
file_path = os.path.expanduser(req.get("path", ""))and the absolute‑path requirement withfile_path = _resolve_path(req.get("path", "")). - Add a
.csvextension check consistent with the upload‑based endpoint. - Keep the
os.path.isfilecheck but now operating on the validatedfile_path.
- Replace
This preserves the basic behavior (“give me columns from this path”) but constrains it to the project tree and only CSV files, eliminating arbitrary filesystem read capability.
| @@ -75,10 +75,26 @@ | ||
|
|
||
|
|
||
| def _resolve_path(path: str) -> str: | ||
| """Resolve a path; relative paths are resolved against the project root.""" | ||
| """Resolve and validate a path under the project root. | ||
|
|
||
| - Expands '~' | ||
| - Resolves relative paths against the project root | ||
| - Normalizes and resolves symlinks | ||
| - Ensures the resulting path stays within the project root | ||
| """ | ||
| if not path: | ||
| raise HTTPException(status_code=400, detail="Path must not be empty.") | ||
| # Expand user home and normalize | ||
| path = os.path.expanduser(path) | ||
| if os.path.isabs(path): | ||
| return path | ||
| return os.path.normpath(os.path.join(_PROJECT_ROOT, path)) | ||
| candidate = os.path.realpath(path) | ||
| else: | ||
| candidate = os.path.realpath(os.path.join(_PROJECT_ROOT, path)) | ||
| project_root_real = os.path.realpath(_PROJECT_ROOT) | ||
| # Ensure the candidate path is within the project root | ||
| if not (candidate == project_root_real or candidate.startswith(project_root_real + os.sep)): | ||
| raise HTTPException(status_code=400, detail="Access to this path is not allowed.") | ||
| return candidate | ||
|
|
||
|
|
||
| def _ensure_stored_copy(dataset_id: str) -> UploadedDataset: | ||
| @@ -264,10 +279,11 @@ | ||
|
|
||
| @router.post("/columns-from-path") | ||
| async def get_csv_columns_from_path(req: dict): | ||
| """Read the header row of a CSV at the given absolute path.""" | ||
| file_path = os.path.expanduser(req.get("path", "")) | ||
| if not file_path or not os.path.isabs(file_path): | ||
| raise HTTPException(status_code=400, detail="Path must be absolute.") | ||
| """Read the header row of a CSV at the given path under the project root.""" | ||
| raw_path = req.get("path", "") | ||
| file_path = _resolve_path(raw_path) | ||
| if not file_path.lower().endswith(".csv"): | ||
| raise HTTPException(status_code=400, detail="Only .csv files are accepted.") | ||
| if not os.path.isfile(file_path): | ||
| raise HTTPException(status_code=400, detail=f"File not found: {file_path}") | ||
| with open(file_path, encoding="utf-8") as f: |
| file_path = os.path.expanduser(req.path) | ||
| if not os.path.isabs(file_path): | ||
| raise HTTPException(status_code=400, detail="Path must be absolute.") | ||
| if not os.path.isfile(file_path): |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 20 days ago
In general, to fix this type of issue you must ensure that any filesystem path derived from user input is validated against a safe root directory (or an explicit allow‑list) after normalization. The usual pattern is: take the user path, expand user (~) if needed, join it to a fixed root, normalize the result with os.path.normpath or os.path.realpath, and then check that the final path is still inside the root. Only then use the path in open, os.path.isfile, os.path.getsize, etc.
In this codebase, the single best fix with minimal functional change is to require that all user-specified paths for /columns-from-path and /load live under the existing _DATA_DIR (the managed backend/data directory). We already have _DATA_DIR and _PROJECT_ROOT; we can implement a small helper that resolves user input against _DATA_DIR safely:
- Create a function
_resolve_safe_path(user_path: str) -> strthat:- Rejects empty paths.
- If
os.path.isabs(user_path), strip any leading path separator and treat it as relative (so/foo.csvbecomesfoo.csv), or simply reject absolute inputs; the most conservative change is to reject them. - Joins the (possibly relative)
user_pathwith_DATA_DIR. - Normalizes the result with
os.path.normpath. - Verifies that the normalized path starts with
_DATA_DIRplus a path separator (or equals_DATA_DIRexactly) to prevent “..” escaping. - Raises
HTTPException(400, ...)if validation fails. - Returns the safe path otherwise.
Then update:
get_csv_columns_from_path(lines 265–279): instead of usingos.path.expanduserand requiring an absolute path, call_resolve_safe_pathon the user-suppliedpath, then performos.path.isfile,open, etc. on the safe path.load_dataset(lines 288–312): similarly, derivefile_pathfrom_resolve_safe_path(req.path)instead ofos.path.expanduserand the absolute-path check. The rest of the logic (size check, parsing) stays unchanged.
This keeps existing endpoint semantics (loading from local files) but limits them to files under the application’s data directory, eliminating arbitrary filesystem access while avoiding new external dependencies.
| @@ -81,6 +81,28 @@ | ||
| return os.path.normpath(os.path.join(_PROJECT_ROOT, path)) | ||
|
|
||
|
|
||
| def _resolve_safe_path(user_path: str) -> str: | ||
| """ | ||
| Resolve a user-provided path safely under the local data directory. | ||
|
|
||
| The resulting path is normalised and validated to ensure it stays within | ||
| the configured _DATA_DIR, preventing directory traversal and access to | ||
| unexpected filesystem locations. | ||
| """ | ||
| if not user_path: | ||
| raise HTTPException(status_code=400, detail="Path must not be empty.") | ||
| # Treat the user input as relative to the data directory | ||
| # to avoid exposing arbitrary filesystem locations. | ||
| # This also prevents use of absolute paths like "/etc/passwd". | ||
| relative = user_path.lstrip(os.sep) | ||
| candidate = os.path.normpath(os.path.join(_DATA_DIR, relative)) | ||
| data_dir_norm = os.path.normpath(_DATA_DIR) | ||
| # Ensure the resolved path is within the data directory | ||
| if not (candidate == data_dir_norm or candidate.startswith(data_dir_norm + os.sep)): | ||
| raise HTTPException(status_code=400, detail="Path is outside the allowed data directory.") | ||
| return candidate | ||
|
|
||
|
|
||
| def _ensure_stored_copy(dataset_id: str) -> UploadedDataset: | ||
| """Ensure a dataset has a managed copy under backend/data.""" | ||
| ds = _uploaded.get(dataset_id) | ||
| @@ -264,10 +286,9 @@ | ||
|
|
||
| @router.post("/columns-from-path") | ||
| async def get_csv_columns_from_path(req: dict): | ||
| """Read the header row of a CSV at the given absolute path.""" | ||
| file_path = os.path.expanduser(req.get("path", "")) | ||
| if not file_path or not os.path.isabs(file_path): | ||
| raise HTTPException(status_code=400, detail="Path must be absolute.") | ||
| """Read the header row of a CSV at the given path under the data directory.""" | ||
| raw_path = req.get("path", "") | ||
| file_path = _resolve_safe_path(raw_path) | ||
| if not os.path.isfile(file_path): | ||
| raise HTTPException(status_code=400, detail=f"File not found: {file_path}") | ||
| with open(file_path, encoding="utf-8") as f: | ||
| @@ -287,7 +308,7 @@ | ||
|
|
||
| @router.post("/load") | ||
| async def load_dataset(req: DatasetLoadRequest): | ||
| """Load a CSV or JSON file from a local absolute path.""" | ||
| """Load a CSV or JSON file from a local path under the data directory.""" | ||
| if req.format not in ("csv", "json"): | ||
| raise HTTPException( | ||
| status_code=400, | ||
| @@ -297,9 +318,7 @@ | ||
| ), | ||
| ) | ||
|
|
||
| file_path = os.path.expanduser(req.path) | ||
| if not os.path.isabs(file_path): | ||
| raise HTTPException(status_code=400, detail="Path must be absolute.") | ||
| file_path = _resolve_safe_path(req.path) | ||
| if not os.path.isfile(file_path): | ||
| raise HTTPException(status_code=400, detail=f"File not found: {file_path}") | ||
|
|
Change Description
Presidio Evaluation Flow
This PR introduces the Presidio Evaluation Flow, a new interactive tool under
evaluation/ai-assistant/that guides users through a human-in-the-loop PII detection evaluation process.USE THE run.md file to run in your environment.
This is the main evaluation branch. PRs will continue to be merged here until the full evaluation flow is complete.
The UI was designed as a Figma Make prototype and converted to working code using the Figma MCP server.
Frontend: React + TypeScript + Vite + Tailwind CSS v4 + shadcn/ui + Recharts
Backend: Python + FastAPI + Poetry + Pydantic v2
Next Steps
entity_typeto match Presidio Analyzer'sRecognizerResultformat.pandas.sample()with fixed seed) and length-based (stratified by text-length terciles into short/medium/long buckets). Frontend now lets users pick the method via radio buttons, with semantic diversity shown as coming soon.presidio-evaluatorfor real precision/recall/F1 calculation against the golden setScreenshots
Checklist